Form Latent Semantic Indexing to Language Models and Back

نویسنده

Thomas Hofmann

چکیده

One of the key challenges in information retrieval is the problem of automated indexing. How can computers be used to automatically extract relevant index terms from documents? How should documents be represented to facilitate information access? Primarily, a good document representation should capture the topical and semantical relationships between documents. Thereby, it should support the computation of similarities between documents and queries or other documents. From the early years of information retrieval, it has been realized that automated indexing should get to the semantic level of the meaning of words. An important example is the idea of notional families in the work of H.P. Luhn [5]. Ideally, notional families group together words of similar and related meaning and use these concepts to encode documents. In this paper, we will put our own work [4] in a new context and show how a combination of ideas from latent semantic indexing [2] with the language modeling approach to information retrieval [6] leads to a statistical retrieval model that is very close in spirit to notional families.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Combination of random indexing based language model and n-gram language model for speech recognition

This paper presents the results and conclusion of a study on the introduction of semantic information through the Random Indexing paradigm in statistical language models used in speech recognition. Random Indexing is an alternative to Latent Semantic Analysis (LSA) that addresses the scalability problem of LSA. After a brief presentation of Random Indexing (RI), this paper describes, different ...

متن کامل

Indexing Audio Documents by using Latent Semantic Analysis and SOM

This paper describes an important application for state-of-art automatic speech recognition , natural language processing and information retrieval systems. Methods for enhancing the indexing of spoken documents by using latent semantic analysis and self-organizing maps are presented, motivated and tested. The idea is to extract extra information from the structure of the document collection an...

متن کامل

Improved Chinese Spoken D with Hybrid Modeling and D Feature

Different models retrieve the documents based on different approaches of extracting the underlying content. Different levels of indexing features also offer different functionalities and discriminabilities when retrieving the documents. In this paper, we present results for Chinese spoken document retrieval with hybrid models to integrate the knowledge obtainable from three basic retrieval mode...

متن کامل

Log-linear models and latent semantic indexing applied to mwe identification

A short introduction characterizes the task of identification of multiword expressions and their idiosyncratic properties. Then, this document gives a detailed description of loglinear models and latent semantic analysis. The description enumerates components of the models, estimation techniques for the model parameters and addresses the interpretation of the models and their evaluation. We als...

متن کامل